# +~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~ #  
#
#' @title  Reduce dimensionality of tweet LASER embeddings
#' @author Hauke Licht
#
# +~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~+~ #

# setup ----

# load packages
library(data.table)
library(readr)
library(dplyr)
library(fastICA)

base_path <- file.path(".")
data_path <- file.path(base_path, "data")

# load tweet embeddings ----

# CAUTION: the file is very large and cannot be shared on dataverse (contact the author)
fp <- file.path(data_path, "intermediate", "embeddings", "tweet_laser_embeddings.tab")
# CAUTION: the file is very large and reading it requires a lot of RAM
tem <- data.table::fread(fp, sep = "\t", header = TRUE)

table(is.na(tem$id)) # should all be FALSE
table(is.na(tem$lang)) # should all be FALSE
table(is.na(tem$text)) # should all be FALSE
table(colMeans(is.na(tem[-c(1:3)]))) # should all be 0 (takes some time to compute though)

tem <- as.data.frame(tem)
gc()
rownames(tem) <- tem$id

# discard those collected post-hoc in June 2023
tmp <- read_csv(file.path(data_path, "input", "all_tweets_texts.csv"), col_select = 1:2)
tem <- tem[tmp$id[tmp$collected_posthoc == "no"], ]
nrow(tem)

gc()

# extract first 300 independent components ----

ica_path <- file.path(data_path, "fits", "laser_embeddings_ica.rds")

sts <- Sys.time()
ic300 <- fastICA(
  X = as.matrix(select(tem, matches("e\\d{4}")))
  , n.comp = 300
  , alg.typ = "parallel"
  , fun = "logcosh"
  , alpha = 1.0
  , method = "C"
  , row.norm = FALSE
  , maxit = 100
  , tol = 1e-04
  , verbose = TRUE
  , w.init = NULL
)
ets <- Sys.time()
str(ic300,1)  

#' `ic300`: object elements
#' 
#' - X: (pre-processed) data matrix
#' - K: pre-whitening matrix that projects data onto the first n.comp principal components.
#' - W: estimated un-mixing matrix s.t XKW = S (W is chosen to maximize the neg-entropy approximation)
#' - A: estimated mixing matrix
#' - S: estimated source matrix (the object of interest)

# get means subtracted from columns when centering
#  - centering is hard-coded in `fastICA`
#  - means are required to `predict` s[i] for new embedding vectors 
# x' = x - \bar{x} => \bar{x} = x - x'
tmp <- as.matrix(select(tem, matches("e\\d{4}")))[1, ] - ic300$X[1,]
all(grepl("e\\d{4}", names(tmp)))

ic300$X.means <- tmp

# add runtime and parameters
ic300$runtime <- ets - sts
ic300$params <- list(
  n.comp = 300
  , alg.typ = "parallel"
  , fun = "logcosh"
  , alpha = 1.0
  , method = "C"
  , row.norm = FALSE
  , maxit = 100
  , tol = 1e-04
  , verbose = TRUE
  , w.init = NULL
)

dimnames(ic300$X) <- list(tem$id, paste0("e", 1:1024))
dimnames(ic300$K) <- list(paste0("e", 1:1024), paste0("ic", 1:300))
dimnames(ic300$W) <- list(paste0("ic", 1:300), paste0("ic", 1:300))
dimnames(ic300$A) <- list(paste0("ic", 1:300), paste0("e", 1:1024))
dimnames(ic300$S) <- list(tem$id, paste0("ic", 1:300))
class(ic300) <- c("fastICA", "list")

# # save complete object
# write_rds(ic300, sub("_ica", "_ica_full", ica_path))

# the object is really large, though 
format(object.size(ic300), "Gb")
# because it contains the (centered) model matrix
format(object.size(ic300$X), "Mb")

# remove model matrix
ic300$X <- NULL
gc()
format(object.size(ic300), "Gb")

# save reduced object
if (!file.exists(ica_path))
  write_rds(ic300, ica_path)

rm(ic300); gc()
